row group
Back to DuckDB Data Engineering Glossary
Definition
A row group is a fundamental storage concept in columnar databases like DuckDB that represents a horizontal partition of data containing a fixed number of rows (122,880 rows by default in DuckDB). Rather than storing all data as one continuous block, the database breaks tables into these smaller chunks to optimize reading and processing.
Technical Details
Each row group contains metadata about the columns it includes, such as minimum and maximum values, which enables the database to skip reading entire row groups that aren't relevant to a query (known as predicate pushdown). Row groups also facilitate parallel processing since different groups can be processed simultaneously by different CPU cores.
Impact on Performance
When writing data to Parquet files using DuckDB, you can control the row group size using the ROW_GROUP_SIZE
parameter. Smaller row groups allow for more granular predicate pushdown but create more metadata overhead. Larger row groups improve compression ratios but may reduce parallelization opportunities. Here's an example of customizing row group size when writing to Parquet:
Copy code
COPY (SELECT * FROM my_table)
TO 'output.parquet'
(FORMAT PARQUET, ROW_GROUP_SIZE 50000);
Row groups are particularly important when working with Parquet files stored in cloud storage like S3, as DuckDB can selectively download only the row groups needed to answer a query, significantly reducing data transfer and query time.
Relationship to Memory Management
DuckDB's vectorized execution engine processes data in smaller chunks called vectors (2,048 values) within each row group. This approach allows DuckDB to maintain efficient memory usage even when working with large datasets, as only a portion of each row group needs to be in memory at any given time.